En primer lugar vamos a obtener algunas estadísticas sobre el corpus que nos permitan orientar el pre-procesamiento posterior. En concreto, nuestro interés debe centrarse en encontrar:
El corpus de Imdb consta de un total de 100K de reviews de películas extraídos de la base de datos de IMDB. Este dataset fue inicialmente diseñado para la investigación en Sentiment Analysis. Por este motivo, 50K de los 100K opiniones están etiquetas con la polaridad (en este caso en forma de nota a la película por el usuario), 25K de ellos corresponden a reviews con opiniones positivas y otros 25K a reviews con opiniones negativas. Igualmente, el dataset está balanceado en cuanto a muestras de entrenamiento y test, 25K para training y 25K para test. Los 50k restante no están etiquetados y están pensados para pruebas de unsupervising learning como es nuestro caso.
En primer lugar vamos a ver cual es la distribución de reviews por películas. Nuestro objetivo es ver si el dataset está balanceado en este sentido ya que aquellas películas con gran número de opiniones podrían crear un sesgo y tendríamos que buscar alguna estrategia para balancearlo nosotros mismos.
Cada review viene en un fichero .txt individual cuyo nombre tiene el siguiente formato: [id]_[rating].txt. El id es único por review. Para cada grupo de reviews existe un fichero [urls_[pos, neg, unsup].txt en el que cada línea contiene el identificador de la película con ID igual al número de línea. Es decir, la línea 0 de este fichero contiene el identificador (URL de IMDB) de la película cuya review pertenece al fichero 0_[rating].txt
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import os
from collections import Counter
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pprint
from nltk import FreqDist
import gensim
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(stop_words='english', strip_accents='unicode')
analyzer = vectorizer.build_analyzer()
id_pattern = re.compile('title\/(.*)\/')
pp = pprint.PrettyPrinter(width=100, compact=True)
def walk_corpus(path, pattern):
import fnmatch
for root, dirnames, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, pattern):
yield os.path.join(root, filename)
def corpus_stats(path, pattern):
cnt = Counter()
for urls_file in walk_corpus(path, pattern):
with open(urls_file) as f:
lines = f.readlines()
for line in lines:
movie_id = id_pattern.search(line).group(1)
cnt[movie_id] += 1
distinct = len(cnt)
size = sum(cnt.values())
print('Number of different movies reviewed: {}'.format(distinct))
print('Number of total reviews: {}'.format(size))
print('Reviews per movie Average: {}'.format(size/distinct))
df = pd.DataFrame.from_dict(cnt, orient='index').reset_index()
ax = df.hist(grid=False)
ax[0][0].set_xlabel("Number of Reviews", labelpad=20, weight='bold', size=12)
ax[0][0].set_ylabel("Number of Movies", labelpad=20, weight='bold', size=12)
corpus_stats('./resources/aclImdb/', 'urls_*.txt')
corpus_stats('./resources/aclImdb/train/', 'urls_unsup*')
corpus_stats('./resources/aclImdb/test/', 'urls_*.txt')
En este apartado vamos a analizar con ayuda de NLTK como es la distribución del texto a lo largo de todo el corpus. Estaremos interesados en caracteríticas como tokens más frecuentes, longitud del corpus, longitud del vocabulario, etcétera.
def tokenize_corpus(path, pattern, mode='d'):
for corpus_file in walk_corpus(path, pattern):
with open(corpus_file, 'r') as next_file:
next_review = next_file.read()
tokens = analyzer(next_review)
if mode == 'd':
yield tokens
else:
for token in tokens:
yield token
La mayoría de herramientas que trabajan con texto en python (NLTK, gensim, Scikit Learn...) necesitan manejar una estructura de datos en la que se implementa una distribución de frecuencias que da lugar a una representación conocida como Bag of Words (BoW) en la que simplemente, por cada documento o a nivel global del corpus, se mantiene un contador con el número de apariciones de cada palabra o token
%time dist = FreqDist(tokenize_corpus('./resources/aclImdb/all/', '*.txt', mode='t'))
print(dist)
df = pd.DataFrame(dist.most_common(100))
df.columns = ['Token', 'Frecuencia']
df.head(20)
df = pd.DataFrame(dist.most_common()[-100:])
df.columns = ['Token', 'Frecuencia']
df.head(20)
Nuestro modelo de tópicos estará basado en una representación BoW del corpus. Únicamente tendremos en cuenta la frecuencia global de los términos y no una frecuencia de documentos tipo TF-IDF. Lo primero que necesitamos construir es un diccionario con nuestro vocabulario. Empezaremos con un vocabulario sin filtros, para comprobar que resultamos obtenemos y si nuestro estudio previo ha tenido sentido a la hora de ayudarnos con el filtrado posterior.
Empezamos a utilizar gensim para construir el diccionario.
stream = tokenize_corpus('./resources/aclImdb/all/', '*.txt')
%time dictionary = gensim.corpora.Dictionary(stream)
dictionary.save('original.dict')
data = [[dictionary.num_docs, dictionary.num_pos, len(dictionary.token2id)]]
df = pd.DataFrame(data)
df.columns=['Numero de reviews analizadas', 'Numero de tokens analizados', 'Numero de tokens únicos actuales']
df.head()
Necesitamos declarar un iterable para acceder en streaming a la representación BoW de cada uno de nuestros documentos (reviews). Este iterable será utilizado de forma eficiente por gensim para entrenar el modelo de forma iterativa en un número determinado de pasadas.
class MovieCorpus(object):
def __init__(self, path, dictionary):
self.__path = path
self.__dictionary = dictionary
def __iter__(self):
for tokens in tokenize_corpus(self.__path, '*.txt'):
yield self.__dictionary.doc2bow(tokens)
def __len__(self):
return len(self.__dictionary)
def explore_topic(lda_model, topic_number, topn, output=True):
"""
accept a ldamodel, atopic number and topn vocabs of interest
prints a formatted list of the topn terms
"""
terms = []
for term, frequency in lda_model.show_topic(topic_number, topn=topn):
terms += [term]
if output:
print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))
return terms
def print_lda_model(lda_model, num_topics=20):
topic_summaries = []
print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')
for i in range(num_topics):
print('\n')
print('Topic '+str(i)+' |---------------------\n')
tmp = explore_topic(lda_model,topic_number=i, topn=10, output=True )
topic_summaries += [tmp[:5]]
print
dictionary = gensim.corpora.Dictionary.load('original.dict')
corpus = MovieCorpus("./resources/aclImdb/all", dictionary)
%time lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary)
print_lda_model(lda_model)
gensim.corpora.MmCorpus.serialize('corpus.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus.mm')
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis
Necesitamos poder comparar los distintos modelos que vamos a ir generando para poder comprobar que van mejorando con las acciones que tomamos. Existen muchas diversas formas de evaluar un modelo LDA, cualquiera compatible con evaluar clusters procedentes de algoritmos de clustering.
Los clusters se suelen evaluar midiendo la coherencia de sus componentes. En nuestro caso concreto, cada tópico tendrá mayor calidad si:
Afortunadamente gensim proporciona sus propias herramientas para medir la coherencia que usamos a continuación.
from gensim.models.coherencemodel import CoherenceModel
cm = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
mc100 = [mc[0] for mc in dist.most_common(100)]
terms_id = lda_model.get_topic_terms(3)
terms_str = [dictionary.id2token[id[0]] for id in terms_id if id[0] in dictionary.id2token]
list(set(mc100) & set(terms_str))
def dictionary_filter_most_frequent(dictionary, dist, n=200):
most_common = dist.most_common(n)
mc_ids = [dictionary.token2id[t[0]] for t in most_common]
dictionary.filter_tokens(bad_ids=mc_ids)
dictionary.compactify()
print('Longitud del vocabulario actual: {}'.format(len(dictionary.token2id)))
dictionary_filter_most_frequent(dictionary, dist)
# Filter out words that occur less than 10 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=10, no_above=0.5)
print('Longitud del vocabulario Filtrado: {}'.format(len(dictionary.token2id)))
Además del lowercase, vamos a eliminar también los plurales
def normalize_dictionary(dictionary):
from textblob import Word
plurals = []
for token in dictionary.values():
if token.endswith('s'):
singular = Word(token).singularize()
if token != singular:
singular_id = dictionary.token2id.get(singular, None)
if singular_id:
plurals.append(dictionary.token2id[token])
dictionary.filter_tokens(bad_ids=plurals)
dictionary.compactify()
return plurals
print('Numero de tokens únicos actuales: {}'.format(len(dictionary.token2id)))
plurals = normalize_dictionary(dictionary)
print('Numero de plurales detectados: {}'.format(len(plurals)))
print('Numero de tokens únicos actuales: {}'.format(len(dictionary.token2id)))
dictionary.save('normalized.v1.dict')
dictionary = gensim.corpora.Dictionary.load('normalized.v1.dict')
corpus2 = MovieCorpus("./resources/aclImdb/all", dictionary)
gensim.corpora.MmCorpus.serialize('corpus2.mm', corpus2)
corpus2 = gensim.corpora.MmCorpus('corpus2.mm')
%time lda_model = gensim.models.ldamodel.LdaModel(corpus2, num_topics=20, id2word=dictionary)
print_lda_model(lda_model)
vis = pyLDAvis.gensim.prepare(lda_model, corpus2, dictionary)
vis
cm = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
def dictionary_keep_n_frequent(dictionary, dist, n=5000):
tokens_by_freq = dist.most_common(len(dist))
mf = []
for t in tokens_by_freq:
id = dictionary.token2id.get(t[0], None)
if id:
mf.append(id)
if len(mf) == n:
break
dictionary.filter_tokens(good_ids=mf)
dictionary = gensim.corpora.Dictionary.load('normalized.v1.dict')
dictionary_keep_n_frequent(dictionary, dist)
corpus = MovieCorpus("./resources/aclImdb/all", dictionary)
gensim.corpora.MmCorpus.serialize('corpus3.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus3.mm')
%time lda_model= gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word=dictionary)
print_lda_model(lda_model, 10)
%time lda_model3 = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary)
print_lda_model(lda_model3, 20)
vis = pyLDAvis.gensim.prepare(lda_model3, corpus, dictionary)
vis
cm = CoherenceModel(model=lda_model3, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
%time lda_model= gensim.models.ldamodel.LdaModel(corpus, num_topics=50, id2word=dictionary)
print_lda_model(lda_model, 50)
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis
cm = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
Un primer filtro que podemos aplicar para evitar bias es un límite sobre el número de revies para una misma película. Usaremos un parámetro configurable con un valor inicial de 10 después de estudiar la primera gráfica
ids_by_path = {}
for urls_file in walk_corpus('./resources/aclImdb/all/', 'urls.urls'):
dirname = os.path.dirname(urls_file)
with open(urls_file) as f:
ids_map = {}
lines = f.readlines()
for index, line in enumerate(lines):
movie_id = id_pattern.search(line).group(1)
ids_map[index] = movie_id
ids_by_path[dirname] = ids_map
line_id_pattern = re.compile('([0-9]+)_[0-9]+')
def tokenize_corpus(path, pattern, min_df=1, mode='d', limit=10):
movie_counter = Counter()
for corpus_file in walk_corpus(path, pattern):
dirname = os.path.dirname(corpus_file)
line_id = int(line_id_pattern.search(corpus_file).group(1))
ids_map = ids_by_path[dirname]
movie_id = ids_map[line_id]
if movie_counter[movie_id] <= limit:
movie_counter[movie_id] += 1
with open(corpus_file, 'r') as next_file:
next_review = next_file.read()
tokens = analyzer(next_review)
if mode == 'd':
yield tokens
else:
for token in tokens:
yield token
%time dist2 = FreqDist(tokenize_corpus('./resources/aclImdb/all/', '*.txt', mode='t'))
print(dist2)
pp.pprint(dist2.most_common(100))
%time dictionary2 = gensim.corpora.Dictionary(tokenize_corpus('./resources/aclImdb/all/', '*.txt'))
data = [[dictionary2.num_docs, dictionary2.num_pos, len(dictionary2.token2id)]]
df = pd.DataFrame(data)
df.columns=['Numero de reviews analizadas', 'Numero de tokens analizados', 'Numero de tokens únicos actuales']
df.head()
dictionary2.save('limited.dict')
dictionary_filter_most_frequent(dictionary2, dist2)
dictionary2.filter_extremes(no_below=10, no_above=0.5)
plurals = normalize_dictionary(dictionary2)
dictionary2.save('limited.normalized.dict')
corpus = MovieCorpus("./resources/aclImdb/all", dictionary2)
gensim.corpora.MmCorpus.serialize('corpus4.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus4.mm')
%time lda_model4 = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary2)
print_lda_model(lda_model4)
vis = pyLDAvis.gensim.prepare(lda_model4, corpus, dictionary2)
vis
cm = CoherenceModel(model=lda_model4, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
#dictionary_keep_n_frequent(dictionary2, dist2)
corpus = MovieCorpus("./resources/aclImdb/all", dictionary2)
gensim.corpora.MmCorpus.serialize('corpus5.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus5.mm')
%time lda_model5 = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary2)
lda_model5.save('limited.normalized.filtered.model')
print_lda_model(lda_model5)
vis = pyLDAvis.gensim.prepare(lda_model5, corpus, dictionary2)
vis
cm = CoherenceModel(model=lda_model5, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
En todos los modelos anteriores hemos visto que existen varios tipos de palabras que concurrente aparecen con bastante frecuencia en varios tópicos, pero que aportan escaso valor a la hora de categorizar por temáticas. Algunos ejemplos de estos tipos de palabras son nombres propios y verbos que podemos filtrar
def dictionary_filter_neutral(dictionary, polarity=0.5):
from textblob import TextBlob
neutrals = []
for token in dictionary.values():
if len(token) > 1:
upper = token[0].upper() + token[1:]
blob = TextBlob(upper)
if abs(blob.polarity) <= polarity and blob.pos_tags[0][1] != 'NNP' and not blob.pos_tags[0][1].startswith('VB'):
neutrals.append(dictionary.token2id[token])
dictionary.filter_tokens(good_ids=neutrals)
dictionary.compactify()
return neutrals
dictionary = gensim.corpora.Dictionary.load('normalized.v1.dict')
print('Número de palabras iniciales: {}'.format(len(dictionary)))
neutrals = dictionary_filter_neutral(dictionary, 0.0)
print("Número de palabras neutrales: {}".format(len(neutrals)))
dictionary_keep_n_frequent(dictionary, dist)
corpus = MovieCorpus("./resources/aclImdb/all", dictionary)
gensim.corpora.MmCorpus.serialize('corpus6.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus6.mm')
%time lda_model_n = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary)
print_lda_model(lda_model_n)
lda_model_n.save('neutral.model')
vis = pyLDAvis.gensim.prepare(lda_model_n, corpus, dictionary)
vis
cm = CoherenceModel(model=lda_model5, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
dictionary = gensim.corpora.Dictionary.load('normalized.v1.dict')
dictionary_keep_n_frequent(dictionary, dist)
corpus = MovieCorpus("./resources/aclImdb/all", dictionary)
tfidf = gensim.models.TfidfModel(corpus)
%time lda_model6 = gensim.models.ldamodel.LdaModel(tfidf[corpus], num_topics=20, id2word=dictionary)
pp.pprint(lda_model6.print_topics(20))
import requests
import json
r = requests.get("http://www.omdbapi.com/?i=tt0379889&apikey=ccedfaeb")
pp.pprint(r.json())
Para analizar las reviews, primero vamos a tokenizar el texto y lo vamos a convertir en una representación Bag of Words con proyección a nuestro diccionario. Esta representación es la que podemos pasar a nuestro modelo LDA para que nos devuelva la distribución de tópicos más probable sobre nuestro texto inicial
good_review_text = """I just saw this at the Toronto International Film Festival in the beautiful Elgin Theatre.
I was blown away by the beautiful cinematography, the brilliant adaptation of a very tricky play and last
but not least, the bravura performance of Al Pacino, who was born to play this role,
which was perfectly balanced by an equally strong performance from Jeremy Irons.<br /><br />
The film deftly explores the themes of love vs loyalty, law vs justice, and passion vs reason.
Some might protest that the content is inherently anti-semitic,
however they should consider the historical context of the story,
and the delicate and nuanced way in which it is told in this adaptation"""
good_review_tokens = analyzer(good_review_text)
lda_model_n.get_document_topics(dictionary.doc2bow(good_review_tokens))
Comprobamos cuales son los 10 tokens más prominentes del tópico asignado con más probabilidad, el tópico 11
def get_topic_tokens(model, topic_id):
terms = model.show_topic(topic_id)
return [item[0] for item in terms]
tokens = get_topic_tokens(lda_model_n, 11)
tokens
shared_tokens = list(set(good_review_tokens) & set(tokens))
pd.options.display.max_colwidth = -1
from IPython.display import display, HTML
def explore_opinions(text, keywords):
from textblob import TextBlob
blob = TextBlob(text)
data = []
for sentence in blob.sentences:
for token in keywords:
if token in sentence.words:
data.append([token, sentence.__str__(), sentence.sentiment[0], sentence.sentiment[1]])
df = pd.DataFrame(data)
df.columns = ['Token', 'Sentence', 'Sentiment Polarity', 'Sentiment Subjectivity']
return df
display(HTML(explore_opinions(good_review_text, shared_tokens).to_html().replace("\\n","<br>").replace('adaptation', '<strong>adaptation</strong>')))
bad_review_text = """I have to admit that although I'm a fan of Shakespeare,
I was never really familiar with this play. And what I really can't say is whether this is a poor adaptation,
or whether the play is just a bad choice for film.
There are some nice pieces of business in it, but the execution is very clunky and the plot is obvious.
The theme of the play is on the nature of debt, using the financial idea of debt and justice as a
metaphor for emotional questions. That becomes clear when the issue of the rings becomes more important than
the business with Shylock, which unfortunately descends into garden variety anti-Semitisim despite
the Bard's best attempts to salvage him with a couple nice monologues.<br /><br />
Outside of Jeremy Irons' dignified turn, I didn't think there was a decent performance in the bunch.
Pacino's Yiddish consists of a slight whine added to the end of every pronouncement, and
some of the better Shylock scenes are reduced to variations on the standard "Pacino gets angry"
scene that his fans know and love. But Lynn Collins is outright embarrassing, to the point where I
would have thought they would have screen-tested her right out of the picture early on.
When she goes incognito as a man, it's hard not to laugh at all the things we're not supposed to laugh at.
With Joseph Fiennes standing there trying to look sincere and complicated, it's hard not to make
devastating comparisons to Gwyneth Paltrow's performance in "Shakespeare in Love."
The big problem however that over-rides everything in this film is just a lack of emotional focus.
It's really hard to tell whether this film is trying to be a somewhat serious comedy or a strangely silly drama.
Surely a good summer stock performance would wring more laughs from the material than this somber production.
The actors seem embarrassed to be attempting humor, and unsure of where to place dramatic and comedic emphasis.
All of this is basically the fault of the director, Michael Radford, who seems to think that the material
is a great deal heavier than it appears to me."""
bad_review_tokens = analyzer(bad_review_text)
lda_model_n.get_document_topics(dictionary.doc2bow(bad_review_tokens))
list(set(bad_review_tokens) & set(tokens))
display(HTML(explore_opinions(bad_review_text, shared_tokens).to_html().replace("\\n","<br>").replace('adaptation', '<strong>adaptation</strong>')))
bb_text = """Drug wars, meth, the lot. I thought no thank you.
I kept hearing how good it was and I kept saying: "No thank you"
Last January I got sick, one of those illnesses you can't quite figure out.
Maybe it was pre and post election depression, I don't know. But I stayed in bed for almost
10 days and then it happened. I saw the first episode and I was immediately and I mean immediately,
hooked. I saw the entire series in 9 days. Voraciously. Now I had time to reflect. Why I wonder.
When I think about it the first thing that comes to mind is not a thing it's Bryan Cranston.
I know the concept was superb as was the writing but Bryan Cranston made it all real.
His performance, the creation of Walter White will be studied in the Acting classes of the future.
He is the one that pulls you forward - as well as backwards and sideways - then I realized that his
creation acquired the power that it acquired, in great part thanks to the extraordinary cast of supporting players.
I could write a page for each one of them but I'm just going to mention Aaron Paul.
I ended up loving him. I developed a visceral need to see him find a way out. Well, what can I tell you.
I know that one day, maybe when my kids are old enough, I shall see "Breaking Bad" again. I can't wait."""
bb_review_tokens = analyzer(bb_text)
lda_model_n.get_document_topics(dictionary.doc2bow(bb_review_tokens))
lda_model_n.show_topic(3)
bb_text_2 = """What do you get when you have a chemistry teacher in a mid life crisis, dying of cancer,
and washing cars as a second job to make ends meet for his middle class family? One of the greatest television
dramas of all time with crazy plot twists, brilliant performances, and unforgettable characters and cinematography.
There is so much to like about the masterpiece that is Breaking Bad. Take your pick: the acting,
the writing, the story lines, the plot, the suspense the cliff hangers, the action scenes, the camera work,
the characters, the character arcs, the realism, the satirical style, any season, the end, the casting, the
dark humor and humor relief, the scenery, the contrast between background and foreground to establish artistic
effect (the sun shiny clear blue skies of the NM desert behind the gruesome organized crime and violence of the
underworld), the mixing of favorite genres (crime caper, dark comedy, western, noir, horror, suspense, action,
drama, thriller, Shakespearean tragedy, dystopia, psychological character study..), the lines/quotes...
the list goes on.
What's amazing about Breaking Bad is it begins so humble and quiet, and as it continues to let its' story unfold,
it explodes. It gets better and better each season until the end in the final season, we don't know if we're watching a
television show or an Academy Award winning motion picture. The show dares to go where no one would have thought
it would go- into a transcendent realm of classic cinema- and it pulls it off beautifully."""
bb_review_tokens = analyzer(bb_text_2)
lda_model_n.get_document_topics(lda_model_n.id2word.doc2bow(bb_review_tokens))
print(list(set(bb_review_tokens) & set(get_topic_tokens(lda_model_n, 10))))
lda_model_n.show_topic(10)
print(list(set(bb_review_tokens) & set(get_topic_tokens(lda_model_n, 9))))
lda_model_n.show_topic(9)
print(list(set(bb_review_tokens) & set(get_topic_tokens(lda_model_n, 1))))
lda_model_n.show_topic(1)